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Abstract 

Many biological qnestions, inclnding the estimation of deep evolntionary his¬ 
tories and the detection of remote homology between protein seqnences, rely 
npon mnltiple seqnence alignments (MSAs) and phylogenetic trees of large 
datasets. However, accurate large-scale multiple sequence alignment is very 
difficult, especially when the dataset contains fragmentary sequences. We 
present UPP, an MSA method that uses a new machine learning technique 
- the Ensemble of Hidden Markov Models - that we propose here. UPP 
produces highly accurate alignments for both nucleotide and amino acid se¬ 
quences, even on ultra-large datasets or datasets containing fragmentary se¬ 
quences. UPP is available at https://github.com/smirarab/sepp, 
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Background 

Multiple sequence alignments of large datasets, containing several thou¬ 
sand to many tens of thousands of sequences, are used for gene family tree 
estimation for multi-copy genes (e.g., the p450 or 16S genes), the estimation 
of viral evolution, remote homology detection, the prediction of the contact 
map between proteins [1], and the inference of deep evolution [2]; however, 
most current MSA methods have poor accuracy on large datasets, especially 
when they evolved under high rates of evolution 

The difficulty involved in estimating accurate large multiple sequence 
alignments is a major limiting factor for phylogenetic analyses of datasets 
containing several hundred sequences or more. Phylogeny estimation meth¬ 
ods that do not require a multiple sequence alignment (e.g., truly alignment- 
free methods mum or almost alignment-free methods such as DACTAL 
0 ) can be used, but alignments are necessary for the estimation of branch 
lengths, dates at internal nodes, the detection of selection, etc. Therefore, 
phylogeny estimation generally operates by using methods such as maximum 
likelihood (ML) on estimated multiple sequence alignments. ML phylogeny 
estimation on datasets containing thousands [8j to tens of thousands |9] of 
sequences is now feasible, but the accuracy of ML trees depends on having 
accurate multiple sequence alignments HD], and estimating highly accurate 
large-scale alignments is extremely challenging; indeed, some datasets with 
only 1,000 sequences can be difficult to align with high accuracy mm- 

Another challenge confronting multiple sequence alignment methods is 
the presence of fragmentary sequences in the input dataset (see Fig. for 
examples of sequence length heterogeneity found in the biological datasets 
used in this study), which can result from a variety of causes, including the 
use of next generation sequencing technologies that can produce short reads 
that cannot be successfully assembled into full-length sequences. 

We present a statistical MSA method that uses a new machine learning 
technique that we will introduce - the Ensemble of Hidden Markov Models 
(HMMs) - to address these limitations. Each ensemble of HMMs is best 
seen as a collection of prohle HMMs for representing a multiple sequence 
alignment, constructed in a phylogeny-aware manner; hence, we refer to this 
method as UPP, for Ultra-large alignments using Phylogeny-aware Profiles. 

UPP uses the HMMER [13] suite of tools (see Methods) to produce an 
alignment, and builds on ideas in SEPP [H]. The basic idea behind UPP 
is to estimate an accurate alignment on a subset of the sequences and align 
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Sequence Length 


Figure 1; Histogram of sequence lengths for four of the biologi¬ 
cal datasets included in this study. These datasets show substantial 
sequence length heterogeneity and contain a mix of full-length and fragmen¬ 
tary sequences. 
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the remaining sequences to the alignment using prohle Hidden Markov Mod¬ 
els ng. UPP has four phases (see Fig. [^. Phase 1 begins with unaligned 
sequences and selects a subset (called the “backbone dataset”) of the se¬ 
quences; the remaining sequences are the “query sequences”. Phase 2 uses 
PASTA [inillTj to compute a multiple sequence alignment and maximum like¬ 
lihood tree (which is unrooted) on the backbone sequences; these are called 
the “backbone alignment” and “backbone tree”, respectively. As PASTA is 
a global alignment method and is not designed for the alignment of fragmen¬ 
tary sequences, UPP preferentially selects the backbone sequences from those 
that are considered to be full-length. In order to determine which sequences 
are “full-length”, UPP only includes backbone sequences within 25% of the 
length of the typical sequence for the given locus. In the case where the 
typical length of the locus is not known, we use the median length of the 
input sequences as an estimate of the average length for the locus. 


Unaligned sequences (input) 


Remaining sequences 
(query set) 


Randomly selected 
backbone 


I 


Backbone alignment and tree 





Figure 2: Overview of the UPP algorithm. The input is a set of aligned 
sequences. This sequence dataset is split into two parts, one the backbone 
dataset and the other the set of query sequences. An alignment and tree are 
estimated for the backbone dataset, and an ensemble of HMMs is constructed 
based on the backbone alignment and tree. The query sequences are then 
aligned to each HMM, and the best scoring HMM for each sequence is used 
to add the query sequence to the backbone alignment. See text for more 
details. 

This part of the UPP’s algorithmic design is similar to alignment methods 
that are based on seed alignments (e.g., the technique used in Infernal [IB]), 
but there is a basic difference between using seed alignments and these back- 
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bone alignments estimated by PASTA. Seed alignments are pre-computed 
alignments that are typically highly curated, and may be based on exper¬ 
imentally verihed structural features of the molecule. UPP does not need 
to have such seed alignments, and instead is an entirely de novo alignment 
method. 

Phase 3 creates a collection of HMMs (called the “Ensemble of HMMs”) 
using the backbone alignment and backbone tree. The process begins by 
including the HMM computed on the entire backbone alignment. Next, the 
backbone tree is decomposed by removing a centroid edge (i.e., an edge that 
splits the tree into two subtrees of approximately equal size). For each of 
these two unrooted subtrees, we use hmmbuild (a command within HMMER) 
to compute an HMM on the backbone alignment restricted to the sequences in 
the leaf set of the subtree, and then add the resulting HMM to the ensemble. 
We repeat this decomposition process until each subtree contains at most 
10 sequences. Thus, this process results in an ensemble of HMMs, each 
computed on an alignment induced by the backbone alignment on one of 
the subtrees. Note also that while the subtrees are local regions within the 
backbone tree, they may not be clades within the tree (for example, in Fig. 
HMM 5 is not based upon a clade). By default hmmbuild combines nearby 
sites with more than 50% gaps into a single match state, making it impossible 
to form a one-to-one mapping between the match states and the gappy sites 
in the original subset alignment. We modify the hmmbuild options to create 
a match state for each site that has at least one non-gap character, thus 
making it trivial to map the match states back to the original sites in the 
subset alignment. 

Phase 4 inserts the remaining query sequences into the backbone align¬ 
ment, as follows. The £t of each query sequence to each HMM is assessed 
using hmmsearch (a command within HMMER); this returns a bit score, 
which is a measure of the quality of the match between the query sequence 
and the HMM. The subset HMM with the best bit score is selected, and the 
sequence is inserted into the subset alignment using hmmalign (a command 
within HMMER). We treat each site within an alignment as a statement of 
positional homology, so that all letters within the site are considered to be 
positionally homologous [19]. Since positional homology is an equivalence 
relation (i.e., a binary relation that is reflexive, symmetric, and transitive), 
by transitivity, this process dehnes how the query sequence should be added 
into the backbone alignment; similar uses of transitivity have been used in 
other multiple sequence alignment methods [ 201 E]. When the sequence has 


5 


a letter (nucleotide or amino acid) that is not aligned to any letter in the 
backbone alignment, the extended alignment will have an “insertion site”. 

Once all the query sequences are added into the backbone alignment, 
transitivity dehnes the hnal output multiple sequence alignment. This ap¬ 
proach will tend to have potentially many insertion sites; in order to save 
space, we combine adjacent insertion sites into a single column. These in¬ 
troduced columns therefore contain nucleotides or amino acids that are not 
homologous to each other, and so the columns are indicated as insertion sites 
and masked before running a phylogenetic analysis. We also do not con¬ 
sider the homologies within these columns when evaluating the accuracy of 
computed alignments. 

As we will show, UPP provides very good accuracy on both phylogenetic 
and structural benchmarks, and is fast enough to produce highly accurate 
alignments on 10,000 sequences in under an hour, and on one million se¬ 
quences in twelve days, using only 12 cores. Furthermore, UPP has excellent 
accuracy even when the sequence dataset contains a large number of highly 
fragmentary sequences. In comparison, most other multiple sequence align¬ 
ment methods either cannot analyze datasets of the same size due to compu¬ 
tational limitations, or do not exhibit the same accuracy as UPP under the 
most challenging conditions (large datasets with fragmentary sequences). 

Results and discussion 

We used a variety of simulated and biological datasets from prior pub¬ 
lications to compare UPP to existing multiple sequence alignment methods 
(see Methods for details). The simulated datasets include ROSE NT: a 
collection of 1000-sequence nucleotide datasets; Indelible lOK: a collection 
of 10,000-sequence nucleotide datasets; RNASim: a collection of datasets 
ranging from 10,000 sequences to 1,000,000 sequences; and ROSE AA: a 
collection of 5000-sequence simulated AA datasets. The biological datasets 
include CRW: the three largest datasets (16S.3, 16S.T, and 16S.B.ALL) 
from the Comparative Ribosomal Website [21] with up to 27,643 sequences; 
10 AA: ten amino acid datasets with curated multiple sequence alignments 
with up to 807 sequences; and HomFam: 19 large HomFam datasets [22] . 
with up to 93,681 sequences. For some of these datasets, we generated frag¬ 
mented versions, making 12.5%, 25%, and 50% of the sequences fragmentary, 
in order to evaluate robustness to fragmentary data. The simulated datasets 
have true alignments and trees available from the prior publications. The 
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biological datasets have reference alignments based on structural features, 
and the CRW and lOAA datasets also have reference trees computed using 
the reference alignments, which are also available from the prior publications. 
The reference alignments for the HomFam datasets are too small (5-20 se¬ 
quences, median 7) and trees computed on these reference alignments were 
too poorly supported to be useful for evaluation purposes. 

We computed ML trees on the estimated alignments, and report tree 
error using the false negative (FN) rate (also known as the missing branch 
rate), and the AFN rate, which is the difference between the FN rates of 
trees computed on estimated and true or reference alignments. We report 
alignment SP-error, which is the average of the sum-of-pairs false negative 
(SPFN) and false positive (SPFP) rates [19]. We also report the total column 
score (TC), which is percentage of aligned columns (i.e., columns with at 
least one homology) in the true or reference alignment that appear in the 
estimated multiple sequence alignment. 


UPP algorithm design. . We explored modihcations of the UPP design in 
which we varied the backbone size, used a single HMM instead of an ensemble, 
built ensembles based on clades within the backbone tree, built ensembles 
based on disjoint subsets of ten sequences each, used different MSA methods 
to compute the backbone alignment, used MAFFT instead of hmmalign to 
add sequences to the backbone alignment, and ran hmmbuild using different 
options to compute HMMs on each subset alignment. These preliminary 
studies revealed the following trends. 

(1) Using small backbones (100 sequences) rather than large backbones 
(1000 sequences) typically produced higher alignment SP-error rates and tree 
error rates for both the Ensemble of HMMs approach and the single HMM 
approach (SOM Section S2.1). Using smaller backbones reduced the running 
time for the Ensemble of HMMs approach and had negligible impact on the 
running time for the single HMM approach (SOM Section S2.1). 

(2) Using an ensemble of HMMs rather than a single HMM with 1000- 
sequence backbones had varying impact. As shown in Table [T| the impact 
on alignment SP-error ranged from neutral (changes of at most 0.3% for 
alignment SP-score or tree error) to benehcial; for example, using an ensemble 
of HMMs had 23.0% alignment SP-error on the HomFam datasets whereas 
using a single HMM produced alignment SP-error of 25.4% (Table [^. The 
impact on TC score also varied: TC scores were better using single HMMs for 
the Indelible simulated datasets, and were otherwise better using ensembles 
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(Table [^. The differences in TC score were generally small (e.g., the average 
difference was less than 0.5%). On the HomFam datasets, using an ensemble 
of HMMs had TC score of 46.6% while a single HMM had TC score of 
44.5% (a difference of 2.1%) and on the Indelible 10000M4 datasets using 
a single HMM had TC score of 30.5% and using an ensemble of HMMs 
had 27.4% (a difference of 3.1%). Finally, the use of an ensemble of HMMs 
instead of a single HMM generally reduced tree error (Table [^. For example, 
results on the CRW datasets show that using an ensemble of HMMs had 
average tree error of 7.8%, but using a single HMM had average tree error of 
16.5% (i.e., more than double the tree error). Substantial reductions in tree 
error were also observed for the RNASim lOK datasets. In a few cases (i.e., 
the ROSE AA and Indelible datasets), using a single HMM improved tree 
error, but the differences were very small (Table [^. The impact of using an 
ensemble of HMMs instead of a single HMM was lessened for 100-sequence 
backbones, and in some cases even led to small improvements (SOM Section 


S2.1). However, the best results were still obtained using the 1000-sequence 


backbones with the ensemble of HMMs. 

(3) Using ensembles of HMMs computed for clades within the backbone 
tree produced alignments and trees that were generally as accurate (accord¬ 
ing to the SP-error and tree error rates) and had variable impact on TC 
scores (generally reducing scores but in some cases improving them) as those 
produced using ensembles based on the centroid-edge decompositions (SOM 
Section S2.1 and SOM Table SI). However, UPP using clade-based ensem¬ 
bles took more time (SOM Section S2.1). 

(4) Using ensembles of HMMs based on disjoint subsets (each with at most 
10 sequences) had variable impact. For many datasets (e.g., the ROSE AA, 
RNASim, CRW, and HomFam datasets) the impact of using disjoint subsets 


was very small, and in some cases even slightly favorable (SOM Section S2.1 


and SOM Table SI). However, for some other datasets using disjoint subsets 
greatly reduced accuracy. For example, on the Indelible 10000M2 datasets, 
default UPP had alignment SP-error of 3.5%, TC score 1.2%, and AFN 
error of 0.6%, but using disjoint subsets had SP-error of 28.2%, TC score 
0.3%, and AFN tree error of 19. 


(SOM Table [Sl|). Thus, although using 
disjoint ensembles of HMMs reduced the running time (SOM Section S2.1), 
the default ensemble of HMMs was a more reliable technique than ensembles 
based on disjoint subsets. 

(5) The technique used to estimate the backbone alignment had a large 
impact on the final alignment and tree (SOM Section S2.3), so that the final 











alignment SP-error very closely matched the initial backbone alignment SP- 
error (SOM Section S2.4). Hence, the best alignment methods are needed to 
produce the backbone alignment. 

(6) Using MAFFT to add sequences to the backbone alignment instead of 
UPP’s default technique (hmmalign, a command within HMMER) reduced 
accuracy (SOM Section S2.5). 

(7) Using different hmmbuild options (such as turning off the entropy¬ 
weighting flag) did not improve accuracy (SOM Section S2.7). 

Overall, the most reliable results were obtained using large backbones 
(1000 sequences), using an ensemble of HMMs, computing the backbone us¬ 
ing PASTA, and using hmmalign to add sequences into the backbone align¬ 
ment. These settings were used for the default version of UPP. However, for 
running-time purposes (so that ultra-large datasets can be analyzed quickly), 
we explore UPP (Fast), the variant of UPP that uses backbones of 100 se¬ 
quences but otherwise uses all the default settings (i.e., restrict the backbone 
to full-length sequences, use an ensemble of HMMs, use PASTA to align 
subsets, etc.). 


Comparison to other MSA methods on full-length sequences. . We used 
Clustal-Omega [22], MAFFT [23], Muscle [23], PASTA [THl HZ], and UPP 
to compute multiple sequence alignments. 

We rank methods by tiers, where the hrst tier contains the method that 
had the best performance as well as any other method that was within 1% of 
the best result on the dataset. Similarly, the second tier contains the method 
not in the hrst tier that had the best performance, and all methods within 1% 
of that method (and so forth for the remaining tiers). The method that had 
the best performance overall within a collection is also identihed. We describe 
the general performance of each method on the full-length datasets (Table 
1^ and fragmentary datasets (Table [^. For the fragmentary results, we take 
the average performance of each method on the entire range of fragmented 
datasets. 

The majority of experiments were run on the homogeneous Lonestar clus¬ 
ter at the Texas Advanced Computing Center (TACC). Because of limitations 
imposed by Lonestar, these analyses are limited to 24 hours, using 12 cores 
with 24 GB of memory; methods that failed to complete within 24 hours or 
terminated with an insufficient memory error message were marked as fail¬ 
ures. For experiments on the million-sequence RNASim dataset, we ran the 
methods on a dedicated machine with 256 GB of main memory and 12 cores 
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Table 1; Comparison of two UPP variants on representative full- 
length datasets with respect to alignment SP-error, tree error, and 
TC scores. All criteria (errors and scores) given as percentages. See text 
for explanation of names of methods and computational platforms used. The 
default setting for UPP is denoted UPP (Default); it uses a backbone of size 
1000, uses PASTA to compute the backbone alignment, and the ensemble of 
HMMs technique. The “NoDecomp” versions of these two methods replace 
the ensemble of HMMs technique with a single HMM. Maximum likelihood 
trees are estimated using RAxML (on the 10 AA datasets) or Fast Tree (all 
other datasets) except for HomFam, where we do not estimate ML trees as 
there are no reference trees for the HomFam datasets. 


Model condition 

Method 

Alignment SP-crror 

AFN 

TC score 

10 AA 

UPP(Default) 

24.2 

3.4 

11.4 

10 AA 

UPP(Default,No Deconip) 

24.5 

5.2 

11.0 

ROSE AA 

UPP(Default) 

2.9 

1.8 

2.6 

ROSE AA 

UPP(Default,No Deconip) 

2.8 

1.4 

2.5 

CRW 

UPP(Default) 

12.5 

7.8 

1.4 

CRW 

UPP(Default,No Deconip) 

13.3 

16.5 

0.9 

HomFani(19) 

UPP(Default) 

23.0 

NA 

46.6 

HomFani(19) 

UPP(Default,No Deconip) 

25.4 

NA 

44.5 

Indel. 10000M2 

UPP(Default) 

3.5 

0.6 

1.2 

Indel. 10000M2 

UPP(Default,No Deconip) 

3.3 

0.5 

1.4 

Indel. 10000M3 

UPP(Default) 

1.3 

0.2 

4.6 

Indel. 10000M3 

UPP(Default,No Deconip) 

1.3 

0.1 

4.8 

Indel. 10000M4 

UPP(Default) 

0.3 

<0.0 

27.4 

Indel. 10000M4 

UPP(Default,No Deconip) 

0.5 

<0.0 

30.5 

RNASini lOK 

UPP(Default) 

9.5 

0.8 

0.5 

RNASini lOK 

UPP(Default,No Deconip) 

11.2 

3.0 

0.3 
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and ran nntil an alignment was generated or the method failed. We also 
performed a limited nnmber of experiments on TACC with UPP’s internal 
checkpointing mechanism, to explore performance when time is not limited. 
All methods other than Mnscle had parallel implementations and were able 
to take advantage of the 12 available cores. 

On fnll-length datasets (Table where nearly all methods were able to 
complete, PASTA was nearly always in the hrst tier with respect to alignment 
SP-error, tree error, and TC scores (the only exceptions being the RNASim 
lOK datasets, where PASTA came in second tier for alignment SP-error and 
the HomFam(17) datasets where PASTA came in second tier for TC score). 
UPP(Defanlt) had the second best performance: in the hrst tier in terms 
of SP-error except for the Indelible lOK and HomFam(2) datasets, where 
it was in the second tier (with 1.2% and 3.4% higher error than the best 
method), in hrst or second tier for tree error, and in the hrst throngh third 
tiers for TC score. MAFFT was in third place, placing in the hrst throngh 
third tiers for alignment SP-error, hrst throngh third tiers for tree error, and 
hrst throngh fourth tiers for TC scores. Muscle and Clustal-Omega came in 
behind MAFFT. Muscle came in second through fourth tiers with respect to 
alignment SP-error, hrst through fourth tiers with respect to tree error, and 
second through fourth tiers with respect to TC score. Clustal-Omega came in 
hrst through fourth tiers with respect to alignment SP-error, second through 
fourth tiers with respect to tree error, and hrst through fourth tiers with 
respect to TC scores. In general, the relative performance of Muscle and 
Clustal-Omega seemed to depend on the type of data, with Muscle doing 
better on the nucleotide datasets and Clustal-Omega doing better on the 
amino acid datasets. 

Thus, for full length sequences, whether with respect to alignment SP- 
error, tree error, or TC scores, on average PASTA came in hrst, UPP came 
in second, MAFFT came in third, and Muscle and Clustal-Omega came in 
behind these methods. 

Comparison to other methods on datasets with fragmentary sequences. . We 
next investigated performance on datasets with fragmentary sequences. As 
shown in Table UPP was in the hrst tier of methods on all the fragmentary 
datasets with respect to alignment SP-error, and in the hrst tier of methods 
for three of the four collections (except for CRW) with respect to tree error, 
where it is in the second tier. PASTA was not in the hrst tier for any 
collection with respect to either criterion, and was instead in the second 
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Table 2: Average alignment SP-error, tree error, and TC score 
across most full-length datasets. We report the average alignment 
SP-error (the average of SPFN and SPFP error) (top), average AFN error 
(middle), and average TC score (bottom), on the collection of full-length 
datasets. All scores represent percentages, and so are out of 100. Results 
marked with an “X” indicate that the method failed to terminate within the 
time limit (24 hours on a 12 core machine). Muscle failed to align two of the 
HomFam datasets; we report separate average results on the 17 HomFam 
datasets for all methods and the two HomFam datasets for all but Muscle. 
We did not test tree error on the HomFam datasets (therefore, the AFN 
error is indicated by “NA”). The tier ranking for each method is shown 
parenthetically. 


Method 

ROSE 

NT 

RNASim 

lOK 

Indelible 

lOK 

ROSE 

AA 

CRW 

10 AA 

HomFam 

(17) 

HomFam 

(2) 

Average Alignment SP-Error 

UPP 

7.8 (1) 

9.5 (1) 

1.7 (2) 

2.9 (1) 

12.5 (1) 

24.2 (1) 

23.3 (1) 

20.8 (2) 

PASTA 

7.8 (1) 

15.0 (2) 

0.4 (1) 

3.1 (1) 

12.8 (1) 

24.0 (1) 

22.5 (1) 

17.3 (1) 

MAFFT 

20.6 (2) 

25.5 (3) 

41.4 (3) 

4.9 (2) 

28.3 (2) 

23.5 (1) 

25.3 (2) 

20.7 (2) 

Muscle 

20.6 (2) 

64.7 (5) 

62.4 (4) 

5.5 (3) 

30.7 (3) 

30.2 (2) 

48.1 (4) 

X 

Clustal 

49.2 (3) 

35.3 (4) 

X 

6.5 (4) 

43.3 (4) 

24.3 (1) 

27.7 (3) 

29.4 (3) 

Average AFN Error 

UPP 

1.3 (1) 

0.8 (1) 

0.3 (1) 

1.8 (1) 

7.8 (2) 

3.4 (2) 

NA 

NA 

PASTA 

1.3 (1) 

0.4 (1) 

<0.1 (1) 

1.3 (1) 

5.1 (1) 

3.3 (1) 

NA 

NA 

MAFFT 

5.8 (2) 

3.5 (2) 

24.8 (3) 

4.5 (3) 

10.1 (3) 

2.3 (1) 

NA 

NA 

Muscle 

8.4 (3) 

7.3 (3) 

32.5 (4) 

3.1 (2) 

5.5 (1) 

12.6(3) 

NA 

NA 

Clustal 

24.3 (4) 

10.4 (4) 

X 

4.2(3) 

34.1 (4) 

3.5 (2) 

NA 

NA 

Average TC score 

UPP 

37.8 (1) 

0.5 (2) 

11.0 (3) 

2.6 (2) 

1.4 (1) 

11.4 (1) 

47.3 (1) 

40.3 (3) 

PASTA 

37.8 (1) 

2.3 (1) 

48.0 (1) 

5.4 (1) 

2.3 (1) 

12.1 (1) 

46.1 (2) 

50.0 (1) 

MAFFT 

31.4 (2) 

0.4 (2) 

7.8 (4) 

0.6 (3) 

0.7 (2) 

12.1 (1) 

45.5 (2) 

46.9 (2) 

Muscle 

9.8 (3) 

<0.0 (2) 

18.3 (2) 

2.7 (2) 

0.7 (2) 

10.5 (2) 

27.7(4) 

X 

Clustal 

5.7 (4) 

0.2 (2) 

X 

3.1 (2) 

0.1 (2) 

11.8 (1) 

38.6 (3) 

31.0 (4) 
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through fourth tiers for alignment SP-error and second and third tiers for 
tree error. MAFFT was in the second and third tiers for alignment SP-error, 
but did reasonably well for tree error: in the hrst tier for CRW and otherwise 
second and third tiers. As before, Muscle and Clustal-Omega did less well 
than the other methods: in the third through hfth tiers, and Clustal-Omega 
was unable to analyze at least one dataset. Note also that the absolute error 
generally increased, and that only UPP had reasonably low alignment SP- 
error and tree error across all these fragmentary datasets. Thus, the relative 
and absolute performance of methods changed between the full-length and 
fragmentary data. 

To examine the impact of fragmentation in detail, see Figure which 
shows results on the ROSE NT 1000M2 (a very challenging condition due 
to high rates of indels and substitutions), with varying levels of fragmen¬ 
tation. UPP’s alignment SP-error increased only slightly with increases in 
fragmentation, even up to the highest degree of fragmentation (50%). All 
other methods exhibited greater increases in alignment SP-error or tree error 
than UPP, as the amount of fragmentation increased. 


Table 3: Average alignment SP-error and tree error across fragmen¬ 
tary datasets. We report the average alignment error (top) and average 
AFN error (bottom) on the collection of fragmentary datasets. Clustal- 
Omega failed to align any of the Indelible 10000M2 fragmentary datasets 
and thus we mark the results with an “X”. The tier ranking for each method 
is shown in parentheses. 


Method 

ROSE NT 

RNASim lOK 

Indelible lOK 

CRW 

(16S.3 and 16S.T) 

Average Alignment SP-Error 

UPP 

8.3 (1) 

11.8 (1) 

2.7 (1) 

16.1 (1) 

PASTA 

25.2 (2) 

47.7 (4) 

8.8 (2) 

23.3 (2) 

MAFFT 

32.5 (3) 

25.5 (2) 

51.3 (3) 

24.5 (3) 

Muscle 

35.3 (4) 

82.2 (5) 

77.6 (4) 

70.6 (5) 

Clustal 

62.0 (5) 

35.0 (3) 

X 

46.7 (4) 

Average AFN Error 

UPP 

1.9 (1) 

3.1 (1) 

2.5 (1) 

7.4 (2) 

PASTA 

25.2 (3) 

21.9 (3) 

9.0 (2) 

8.2 (2) 

MAFFT 

18.0 (2) 

6.2 (2) 

35.6 (3) 

2.5 (1) 

Muscle 

27.5 (4) 

43.6 (5) 

45.2 (4) 

30.1 (3) 

Clustal 

47.8 (5) 

26.3 (4) 

X 

37.4 (4) 


To better understand why UPP is robust to fragmentation, we explored 
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[-•- Clustal-Omega MAFFT |-h- UPP(Default) 
[^Muscle 1+ PASTA 


(a) Alignment error on fragmentary ROSE NT 1000M2 datasets 



RCIustal-OmegaEMAFFT -B- UPP(Default) 
gMuscle [+PASTA 


(b) AFN error on fragmentary ROSE NT 1000M2 datasets 

Figure 3: Impact of fragmentary sequences on alignment SP-error 
and tree error. We show alignment and tree error rates for different 
methods on the ROSE NT 1000M2 datasets, but include results where a per¬ 
centage of the sequences are made fragmentary, varying the percentage from 
0% to 50%. Fragmentary sequences have average length 500 (i.e., roughly 
half the average sequence length for ROSE 1000M2). 
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UPP variants (called UPP-random) in which we did not constrain the back¬ 
bone to be only fnll-length seqnences. We also looked at whether the nse of 
the ensemble of HMMs instead of a single HMM contribntes to robnstness to 
fragmentation. These comparisons (Fig. revealed some interesting trends 
abont the impact of these algorithm design parameters. First, the only UPP 
variants that were able to align all the datasets were the two that nsed the 
ensemble of HMMs; the variants that nsed a single HMM each failed to align 
several datasets becanse HMMER was not able to align some of the qnery se¬ 
qnences to the backbone alignment (Fig.|^. Second, the comparison between 
UPP-random(Defanlt) and UPP(Defanlt)) favored UPP(Defanlt), so that 
while there were negligible to small differences in some cases, UPP(Defanlt) 
was dramatically more accnrate than UPP-random(Defanlt) on the ROSE 
NT datasets for both alignment SP-error and tree error (Fig. |^. Thns, 
restricting the backbone to fnll-length seqnences is a very important con- 
tribntion to robnstness to fragmentary seqnences. However, restricting the 
backbone to fnll-length seqnences and nsing only a single HMM prodnced 
mnch higher tree error than nsing an ensemble of HMMs (Fig. |^, showing 
that nsing an ensemble of HMMs also provides benehts. Thns, the two al¬ 
gorithmic techniqnes (restricting the backbone to fnll-length seqnences, and 
using an ensemble of HMMs) are both useful to improving robustness to 
fragmentary sequences, but they address different analytical challenges. 

Impact of taxon sampling. . We evaluated the ability for different methods to 
analyze very large datasets (up to one million sequences), using subsets of the 
million-sequence RNASim dataset; this comparison also reveals the impact 
of taxon sampling on the alignment methods. We examined performance 
for UPP (Fast), the fast version of UPP that differs from the default setting 
of UPP only in that it uses smaller backbones (100 sequences instead of 
1000). Figure]^ shows results for 10,000 to 200,000 sequences, and compares 
UPP(Fast), UPP(Default), PASTA, MAFFT, Muscle, and Clustal-Omega, 
limiting analyses to 24 hours on a 12-core 24 Gb machine. While all methods 
shown were able to complete analyses on the lOK dataset, only UPP(Fast) 
and PASTA completed analyses on the lOOK and 200K datasets. 

As the number of sequences in the RNASim datasets increased, PASTA’s 
alignment SP-error dropped from 15.0% at 50,000 sequences to 12.2% at 
200,000 sequences. UPP(Fast) had stable alignment SP-error across all the 
datasets, varying between 12.5 to 13.3%. Both UPP and PASTA trees im¬ 
proved with increased taxon sampling, with PASTA trees approaching the 
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ROSE NT RNASimlOK Indel. 10K CRW 

l7IUPP-random(Default,No Decomp)0UPP(Defaull,No Decomp) 0UPP-i-andom(Default)0UPP(Default) 


(a) Alignment error 



[71UPP-random(Delault,No Decomp)^UPP(Default,No Decomp) |7iUPP-random(Defaull)[/1UPP(Delault) 

(b) Delta FN tree error 

Figure 4: Average alignment SP-error and tree error of 

UPP(Default), UPP(Default, NoDecomp), UPP-random(Default), 
and UPP-random(Default, NoDecomp) on the fragmentary 
datasets. UPP-random does not restrict the backbone to full length se¬ 
quences, and so allows fragmentary sequences to be in the backbone set. 
UPP-random(Default, NoDecomp) failed to align at least one dataset from 
each of the RNASim lOK, Indelible lOK, and CRW model conditions. 
UPP(Default, NoDecomp) failed to align at least one dataset from each of the 
ROSE NT, RNASim lOK, and Indelible lOK model conditions. Maximum 
likelihood trees were estimated using Fast Tree under GTR. 
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UPP(Fast) 

UPP(Default) 


(a) Alignment error on RNASim datasets with lOK to 200K sequences 
0.3-- 


: 0.1 - 


10K 50K 100K 


50K 100K 200K 

Number of sequences 

Clustal-OmegaHMAFFT0UPP(Fast) 


^Muscle ^PASTA l^UPP(Default) 

(b) FN tree error on RNASim datasets with lOK to 200K sequences 



MCIustal-OmegaHMAFFTMUPP(Fast) 

0Muscle 0PASTA |2UPP(Default) 

(c) AFN tree error on RNASim datasets with lOK to 200K sequences 


Figure 5: Alignment SP-error and tree error rates on RNASim 
datasets with up to 200K sequences. Results not shown are due 
to methods failing to return an alignment within the 24-hour time period 
on TACC, using 12 cores. Maximum likelihood trees were estimated using 
Fast Tree under GTR. 
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accuracy of maximum likelihood on the true alignment (0.1% to 0.2% AFN), 
and UPP trees close behind (1.2% to 1.4% AFN, Fig. [^c)). 

We then compared UPP (Fast) to PASTA on the full 1,000,000-sequence 
RNASim dataset. We ran UPP (Fast) and PASTA on a dedicated machine 
with 12 cores and 256 GB of memory so that the analyses could exceed the 
24 hour time limit in TACC. UPP (Fast) completed in 12 days, with align¬ 
ment and tree errors similar to previous results (12.8% alignment SP-error 
and 2.0% AFN). PASTA completed in 15 days, and produced a much worse 
alignment but better tree (18.5% alignment SP-error and 0.4% AFN). Be¬ 
cause we used a different machine with a different architecture, the running 
times on the 1,000,000-sequence RNASim dataset cannot be directly com¬ 
pared to the running times on the other RNASim datasets, which were run 
on TACC. 

Computational issues. . Table compares wall clock running times, using 12 
cores, on those datasets where all methods were able to complete within the 
24 hours limitation on Lonestar; thus, we show results on all datasets except 
for the RNASim datasets with 50K or more sequences. Note that all methods 
but Muscle had parallel implementations and were able to take advantage of 
the 12 available cores; the relative performance differences between methods 
could greatly differ on a single core machine, depending on how well each 
method is able to take advantage of parallelism. 

The differences in average running time on these datasets were some¬ 
times small (e.g., all methods completed analyses using between 0.4 to 0.6 
hours wall clock time for the ROSE NT datasets with 1000 sequences, and 
in less than 0.2 hours wall clock time for the 10 AA datasets with under 
1000 sequences). However, on the CRW datasets, which could be quite large 
(nearly 28K sequences), the differences in average running time were large: 
UPP(Default) used 11.6 hours. Muscle used 5.9 hours, PASTA used 3.2 hours, 
Clustal-Omega used 2.8 hours, and MAFFT used only 0.4 hours. Overall, 
on these datasets, MAFFT was generally the fastest (or nearly so), and 
UPP(Default) generally the slowest. 

We compared the wall clock running time for each stage of the UPP al¬ 
gorithm for UPP (Default) and UPP (Fast) on two large nucleotide datasets: 
the RNASim lOK dataset with 10,000 sequences and the CRW 16S.B.ALL 
dataset with 27,643 sequences (Table [^. Only two steps - computing the 
backbone alignment and tree and searching for the best HMM - used more 
than a few minutes, even on the largest dataset. Computing the backbone 
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Table 4: Wall clock running time across most full-length datasets. 

We report average wall clock running time on the full-length datasets for 
which most methods could complete; this includes everything other than the 
RNASim datasets with 50,000 or more sequences. UPP is run in default 
mode. Results marked with an “X” indicate that the method failed to ter¬ 
minate within the time limit (24 hours on a 12 core machine). All methods 
but Muscle had parallel implementations and were able to take advantage of 
the 12 cores. Muscle failed to align two of the HomFam datasets; we report 
separate average results on the 17 HomFam datasets for all methods and the 
two HomFam datasets for all but Muscle. 


Method 

ROSE 

NT 

RNASim 

lOK 

Indelible 

lOK 

ROSE 

AA 

CRW 

10 AA 

HomFam 

(17) 

HomFam 

(2) 

Average wall clock running time (hr) 

UPP 

0.6 

6.7 

6.7 

0.2 

11.6 

<0.1 

1.3 

0.5 

PASTA 

0.6 

3.9 

1.3 

0.2 

3.2 

0.2 

1.5 

1.3 

MAFFT 

0.4 

0.1 

1.4 

<0.1 

0.4 

0.1 

<0.1 

0.1 

Muscle 

0.5 

0.8 

1.2 

<0.1 

5.9 

0.2 

1.3 

X 

Clustal 

0.4 

4.8 

X 

0.1 

2.8 

<0.1 

0.3 

0.3 


alignment and tree took under an hour for UPP (Default) and under 8 min¬ 
utes for UPP (Fast). However, searching for the best HMM for the query 
sequences took the most time. For UPP (Default), which had 10 times as 
many HMMs as UPP(Fast), this step took nearly 16 hours on 16S.B.ALL 
and 7 hours on the RNASim lOK dataset, while UPP (Fast) used under 1.8 
hours on the 16S.B.ALL dataset and 0.8 hours on the RNASim lOK dataset. 
Thus, the vast majority of the time on large datasets is spent searching for 
the best HMM. On very small datasets, the running time difference beween 
UPP (Default) and UPP (Fast) will be small, but on very large datasets the 
running time differences will be substantially increased - close to an order of 
magnitude of difference in running time. 

We then explored how UPP’s running time (measured using wall clock 
time) scaled with the size of the dataset by exploring subsets of the RNASim 
dataset with 10,000 to 200,000 sequences, using 12 cores. Running times for 
UPP (Fast) on the RNASim datasets showed a close to linear trend, so that 
UPP(Fast) completed on lOK sequences in 55 minutes, on 50K sequences in 
4.2 hours, on lOOK sequences in about 8.5 hours, and on 200K sequences in 
about 17.8 hours (Fig. |^. 
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Table 5: Running time for UPP(Fast) and UPP(Default) on the 
RNASim lOK and CRW 16S.B.ALL datasets. We show the wall clock 
running time (hr) for each stage of UPP (Fast) and UPP (Default) on the 
RNASim lOK (10,000 sequences) and CRW 16S.B.ALL (27,643 sequences) 
datasets, two of the largest nucleotide datasets. The UPP alignments were 
computed on TACC’s Lonestar Cluster machine. The vast majority of the 
running time was spent searching for the best HMM for the query sequences. 




Wall clock running time (hr) 



RNASim lOK 

CRW 16S.B.ALL 

Stage 

UPP (Fast) 

UPP(Default) 

UPP (Fast) 

UPP (Default) 

Building Backbone 

0.12 

0.42 

0.13 

0.52 

Building HMMs 

<0.01 

0.02 

<0.01 

0.02 

Searching for best HMM 

0.83 

6.53 

1.81 

15.45 

Aligning sequences 

0.02 

0.03 

0.05 

0.15 

Merge alignments 

0.01 

0.01 

0.01 

0.02 

Total time: 

0.99 

7.01 

2.01 

16.16 



Sok 10OK 1MK 

Number of sequences 


• UPP(Fast) 

Figure 6: Running time for UPP(Fast) on the RNASim datasets. 

We show running time to generate an alignment for UPP (Fast) on RNASim 
datasets with lOK, 50K, lOOK, and 200K sequences. All analyses were run 
on TACC with 24 GB of memory and 12 CPUs. 
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Conclusions 


Although the relative performance of multiple sequence alignment meth¬ 
ods depended on the dataset, in most cases UPP produced alignments with 
lower SP-error rates and higher TC scores than MAFFT, Muscle, and Clustal- 
Omega, and maximum likelihood trees computed on UPP alignments were 
also more accurate than ML trees on these other alignments. However, the 
comparison between UPP and PASTA is more interesting. Because UPP 
uses PASTA to compute its backbone alignment and tree, by design, UPP 
is identical to PASTA on fragment-free datasets containing at most 1000 
sequences. The comparison between UPP and PASTA with respect to align¬ 
ment accuracy is interesting: UPP alignments tend to have lower SP-error 
rates than PASTA alignments but also lower TC scores, indicating that these 
two criteria are not that well correlated. However, ML trees based on PASTA 
alignments (for fragment-free datasets) are typically more accurate than ML 
trees based on UPP alignments. On datasets with fragmentary sequences, 
UPP has nearly the same SP-error rates that it achieves on the full-length 
sequences, while PASTA’s SP-error rates increase substantially with fragmen¬ 
tation; consequently, UPP’s AFN tree error rates do not tend to increase that 
much with fragmentation although they do for PASTA. Thus, UPP is highly 
robust to fragmentary data whereas PASTA is not. Hence, while PASTA 
has an advantage over UPP on datasets without fragments, UPP presents 
advantages relative to PASTA for datasets with fragments. 

To understand UPP’s performance, it is useful to consider the alignment 
strategy it uses. First, it computes a backbone alignment using PASTA on a 
relatively small (at most 1000-sequence) dataset; this allows it to begin with 
a highly accurate alignment. Then, instead of using a single prohle HMM 
to represent its backbone alignment, UPP uses a collection of prohle HMMs, 
each on a subset of the sequences. The subsets are obtained from local 
regions of the backbone tree, which is an ML tree estimated on the backbone 
sequences. Hence, the sequences in these subsets tend to be closely related. 
The induced subset alignments for these smaller localized regions are thus 
better suited for HMMs, especially when the full dataset displays overall 
substantial heterogeneity. 

These observations help explain why using multiple HMMs, each on a 
region within the backbone tree, provides improved alignments compared to 
the use of a single HMM. However, UPP also restricts the backbone to the full 
length sequences, and this algorithmic step is critical to improving robustness 
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to fragmentary sequences. Hence, these aspects of UPP’s algorithmic design 

- restricting the backbone to full length sequences and using an ensemble of 
HMMs instead of a single HMM - increase sensitivity to remote homology 
(especially for fragmentary sequences) and reduces alignment SP-error and 
tree error, but each targets a different aspect of the algorithmic performance. 

UPP exhibits great scalability with respect to running time (which scales 
in a nearly linear manner), parallelism, and alignment accuracy. For example, 
our study showed the alignment SP-error on the backbone alignment is quite 
close to the alignment SP-error on the alignment returned by UPP. Thus, 
UPP enables large datasets to be aligned nearly as accurately as smaller 
datasets. 

Overall, UPP is a multiple sequence alignment method that can provide 
very high accuracy on sequence datasets that have been considered too diffi¬ 
cult to align, including datasets that evolved with high rates of evolution, that 
contain fragmentary sequences, or that contain many thousands of sequences 

- even up to one million sequences. UPP performs well on both phylogenetic 
and structural benchmarks (see [25] for further discussion of these related 
but different tasks). Finally, UPP is parallelized (for shared memory) and 
has a checkpointing feature, but does not require supercomputers to achieve 
excellent accuracy on ultra-large datasets in reasonable timeframes. 

Methods 

Performance Study 

Data and software availability. The datasets used in this study are available 
at |26|. The github site for UPP [2^ provides open source software and 
instructions on how to download, install, and run UPP. 

Datasets.. All datasets used in our study were used in previously published 
studies, and are available online through the respective publications. Because 
UPP is designed for ultra-large scale multiple sequence alignment, we focus 
the analysis on benchmark datasets with many sequences. We used the 
following collections of simulated datasets: 

• ROSE NT: a collection of 1000-sequence nucleotide datasets from m 
that were generated using ROSE [2S|; see m for full details. 

• Indelible lOK: a collection of 10,000-sequence nucleotide datasets 
from [16] that were generated by Indelible [29]; see [I6] for full details. 
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• RNASim: a collection of datasets ranging from 10,000 sequences to 
1 ,000,000 sequences ra- 

• ROSE AA: a collection of 5000-sequence simulated AA datasets from 
[9] that were generated using ROSE. 

We also used biological datasets with reference alignments that were used in 
prior studies [121 El |22] to evaluate alignment methods on large datasets. 
We focus on datasets with 10,000 or more sequences, but also used ten large 
amino-acid datasets (8 from the BAliBASE [30] collection, and two others) 
with at least 320 sequences. 

• CRW: the three largest datasets from the Comparative Ribosomal 
Website (CRW) [21], each a set of 16S sequences. We include the 
16S.3 dataset (6,323 sequences spanning three phylogenetic domains), 
the 16S.T dataset (7,350 sequences spanning three phylogenetic do¬ 
mains), and the 16S.B.ALL dataset (27,643 sequences spanning the 
bacteria domain). The CRW datasets have highly reliable, curated 
alignments inferred from secondary and tertiary structures and were 
previously studied in mm- The reference trees on these datasets 
used in these studies were derived from maximum likelihood trees esti¬ 
mated using RAxML, with all branches with bootstrap support below 
75% collapsed. 

• 10 AA: ten amino acid datasets with curated multiple sequence align¬ 
ments (the eight largest BAliBASE datasets [30] and IGADBL_100 and 
coli_epi_100 from [3l]); these range in size from 320 to 807 sequences, 
and were used in mi to evaluate multiple sequence alignment meth¬ 
ods. The reference trees on these datasets used in these studies were 
based on RAxML with all branches with bootstrap support below 75% 
collapsed. 

• HomFam: a collection of 19 of the largest HomFam datasets, which 
are amino acid sequence datasets ranging in size from 10,099 to 93,681 
sequences with Homstrad [32] reference alignments on small subsets 
(5-20 sequences, median 7). These 19 datasets were used in [22l E] 
to evaluate multiple sequence alignment methods on large amino acid 
datasets. The study in [22] also explored performance on smaller Hom¬ 
Fam datasets, but these are not as relevant to this study. As noted 
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in na, the HomFam rhv dataset studied in [22] had a warning on the 
PFAM website that the alignment was “very weak”; for that reason, 
the rhv dataset was omitted from the study reported in na and from 
this one. 


For some of the nucleotide datasets, we generated three fragmented ver¬ 
sions, making 12.5%, 25%, and 50% of the sequences fragmentary. The 
lengths of the fragments were drawn from a normal distribution with a mean 
length of 500 bps and a standard deviation of 60 bps (mean length is one- 
third of the average length of the CRW datasets and one-half the length of 
the Indelible and ROSE NT datasets). We generated fragmentary datasets 
by selecting a random subset of sequences and a random substring (of the 
given length) for each selected sequence. 

Alignment and Tree Estimation Software. . 


Basic alignment methods 

Each dataset was aligned (when possible) using Clustal-Omega [22] ver¬ 
sion 1.2.0, MAFFT [23] version 6.956b, MUSCLE [21] version 3.8.31, and 
PASTA version 1.5.1 [Ml |I7[. MUSCLE was run with the “-maxiters 2” 
option on datasets of 3000 sequences or greater. Due to a bug in earlier 
versions of MAFFT 6.956b, MAFFT-default was run using MAFFT version 
7.143. We ran three different versions of MAFFT. MAFFT-L-INSI was run 
on datasets with 1,000 for fewer sequences. For most datasets with more than 
1,000 sequences, we ran MAFFT-default (“—auto”); the exceptions were the 
RNASim lOOK dataset, three replicates from the Indelible lOK 10000M3 
dataset, and the CRW 16S.B.ALL dataset, where MAFFT-default failed to 
run and so we used MAFFT-PartTree. All MAFFT variants included the 
“—ep 0.123” parameter. 

Because the algorithmic design parameters for running PASTA on amino 
acid datasets had not been studied, we examined different options for run¬ 
ning PASTA on amino acid datasets and used those settings in our studies 
of amino acid datasets (see SOM Section S3). PASTA was run for three 
iterations or a maximum of 24 hours, whichever came hrst. If PASTA did 
not terminate at the end of 24 hours, the alignment from the last successfully 
completed iteration were used. PASTA was run using a MAFFT-PartTree 
starting tree for all but the RNASim datasets. For the RNASim datasets, we 
used the ML tree estimated on the UPP(Fast, NoDecomp) alignment as the 
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starting tree (MAFFT-PartTree was unable to run on the largest RNASim 
datasets). The remaining settings for PASTA were set using the “—auto” 
flag. 

Commands for each method are given below: 

• Clustal-Omega: clustalo —threads=12 -i<input.sequence> -o <output.alignment> 

• MAFFT-L-INS-i: mafft - -ep 0.123 —thread 12 —localpair —maxiterate 
1000 --quiet —anysymbol <input.sequence> > <output.alignment> 

• MAFFT-default: mafft --thread 12 —ep 0.123 —auto —quiet —anysymbol 
<input.sequence> > <output.alignment> 

• MAFFT-PartTree: mafft —thread 12 —ep 0.123 —parttree —retree 2 
—partsize 1000 —quiet <input.sequence> > <output.alignment> 

• MAFFT-profile: mafft [—localpair —maxiterate 1000] [—addfragment 
I —add] <query.file> <backbone.alignment> > <output.alignment> 

• MUSCLE: muscle [-maxiters 2] -in <input.sequence> -out <output.alignment> 

• PASTA: python run.pasta.py --num-cpus=12 -o <output.directory> -i 
<input.sequences> -t <starting.tree> —auto —datatype=<molecule.type> 

• UPP: python exhaustive.upp.py -a <backbone.alignment> -t <backbone.tree> 

-s < query.sequences> -d <output.directory> -o <output.name> -x 12 

-A 10 -m <molecule.type> -c <default.config.file> 

• UPP-disjoint: python exhaustive.upp.py -S normal -a <backbone.alignment> 

-t <backbone.tree> -s <query.sequences> -d <output.directory> -o <output.name> 
-X 12 -A 10 -m <molecule.type> -c <default.config.file> 

HMMER Commands 

HMMER 3.0 [T3j was used internally within UPP for building the en¬ 
semble of HMMs (hmmbuild), for searching for the best HMM for a query 
sequence (hmmsearch), and for inserting the query sequence into the align¬ 
ment (hmmalign): 

• hmmbuild: 

hmmbuild —symfrac 0.0 —informat afa —<molecule.type> <output.profile> 

< backbone.alignment> 
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• hmmsearch: 

hmmsearch —noali -o <output^file> —cpu 1 -E 99999999 —max <inpuEprofile> 

<query^file> 

• hmmalign: 

hmmalign —allcol —dna <outpuEprofile> <query^file> <outpuEalignment> 

Maximum Likelihood Tree Estimation 

To compute maximum likelihood trees on large datasets (with 1000 or 
more sequences) we used Fast Tree [9] version 2.1.5 SSE3, and we used RAxML 
[8] version 8.0.6 for smaller datasets. We used the General Time Reversible 
(GTR) model for all the nucleotide datasets (simulated and biological) and 
JTT for the simulated amino acid datasets (ROSE AA). For the 10 AA 
datasets (all biological), we used ProtEST [33] to select the model for each 
dataset, and then used that model within RAxML to perform the analysis. 

The version number and commands used to run each method are given below. 

• FastTree A A; 

EastTreeMP -nosupport <input_fasta> > <output_tree> 

• FastTree NT: 

EastTreeMP -nosupport -nt -gtr <input_fasta> > <output_tree> 

• RAxML AA: 

raxmlHPC-T 12 -m PROP <modeLname> GAMMA -j -n <output-name> 

<startingJree> -s <input_fasta> > -w <outputMirectory> -p 1 

Performance Metrics.. We compare estimated alignments and their ML trees 
to reference alignments and trees. We use FastSP [I9] to compute SP-error 
(the average of SPFN and SPFP error) and TG scores. The SPFN rate is the 
sum-of-pairs false negative rate (which is the percentage of the homologous 
pairs in the reference alignment that are not in the estimated alignment) 
and the SPFP is the sum-of-pairs false positive rate (which is the percentage 
of homologous pairs in the estimated alignment that are not present in the 
reference alignment). 

We report tree error using the false negative (FN) rate (also known as 
the missing branch rate), which is the percentage of internal edges in the 
reference tree that are missing in the estimated tree. We also report AFN, 
the difference between the FN rate of the estimated tree and the FN rate of 
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the tree estimated on the true alignment, to evaluate the impact of alignment 
estimation on phylogenetic analysis. Most typically, AFN > 0, indicating 
that the estimated tree has higher error than the ML tree on the true align¬ 
ment, but it is possible for AFN < 0, which happens when the estimated 
ML tree is more accurate than ML on the true alignment. 
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